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1 Analysis and implementation of software rejuvenation in cluster systems j 
Kalyanaraman Vaidyanathan, Richard E. Harper, Steven W. Hunter, Kishor S. Trivedi 
June 2001 ACM SIGMETRICS Performance Evaluation Review , Proceedings of the 

2001 ACM SIGMETRICS international conference on Measurement and 
modeling of computer systems SIGMETRICS '01, Volume 29 issue l 
Publisher: ACM Press 

Full text available: ^ pdf(983.05 KB) Additional Information: full citation , abstract , references , citings 

Several recent studies have reported the phenomenon of "software aging", one in which 
the state of a software system degrades with time. This may eventually lead to 
performance degradation of the software or crash/hang failure or both. "Software 
rejuvenation" is a pro-active technique aimed to prevent unexpected or unplanned outages 
due to aging. The basic idea is to stop the running software, clean its internal state and 
restart it. In this paper, we discuss software rejuvenation as applied to ... 

2 Quantifying and Improving the Availability of High-Performance Cluster-Based [ 
Internet Services 

Kiran Nagaraja, Neeraj Krishnan, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen 
November 2003 Proceedings of the 2003 ACM/IEEE conference on Supercomputing 
Publisher: IEEE Computer Society 

Full text available: ^ pdf(306.Q1 KB) Additional Information: full citation , abstract 

Cluster-based servers can substantially increase performance when nodes cooperate to 
globally manage resources. However, in this paper we show that cooperation results in a 
substantial availability loss, in the absence of high-availability mechanisms. Specifically, 
we show that a sophisticated cluster-based Web server, which gains a factor of 3 in 
performance through cooperation, increases service unavailability by a factor of 10 over a 
non-cooperative version. We then show how to augment this W ... 

3 Performance and dependability evaluation of scalable massively parallel computer j 
systems with conjoint simulation 
Axel Hein, Mario Dal Cin 

October 1998 ACM Transactions on Modeling and Computer Simulation (TOMACS), 

Volume 8 Issue 4 
Publisher: ACM Press 

Full text available: 1 S|pdf(501.59 KB) Additional Information: full citation , abstract, references , citings, index 
^ terms , review 

Computer systems are becoming more and more a part of our daily life; business and 
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industry rely on their service, and the health of human beings depends on their correct 
functioning. Computer systems used for critical tasks have to be carefully designed and 
tested during the early design stage, the prototype phase, and their operational life. 
Methods and tools are required to support and facilitate this vital task. In this article, we 
tackle the issue of system-level performance and depen ... 

Keywords: fault-tolerant and large-scale computer systems, hierarchical model design, 
object-oriented modeling, process-based simulation, timed Petri nets 
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Improving cluster availability using workstation validation 
Taliver Heath, Richard P. Martin, Thu D. Nguyen 

June 2002 ACM SIGMETRICS Performance Evaluation Review , Proceedings of the 
2002 ACM SIGMETRICS international conference on Measurement and 
modeling of computer systems SIGMETRICS '02, volume 30 issue l 

Publisher: ACM Press 

Full text available: ^ pdf(201.72 KB) Additional Information: full citation , abstract , references 

We demonstrate a framework for improving the availability of cluster based Internet 
services. Our approach models Internet services as a collection of interconnected 
components, each possessing well defined interfaces and failure semantics. Such a 
decomposition allows designers to engineer high availability based on an understanding of 
the interconnections and isolated fault behavior of each component, as opposed to ad-hoc 
methods. In this work, we focus on using the entire commodity workstation ... 

On the handoff arrival process in cellular communications 

Philip V. Orlik, Stephen S. Rappaport 

March 2001 Wireless Networks, volume 7 issue 2 

Publisher: Kluwer Academic Publishers 

Full text available: HI pdfd 56.96 KB) Additional Information: full citation , references , index terms 



Keywords: cellular communications, handoffs handovers, telecommunications traffic 
performance 



6 Wide area traffic: the failure of Poisson modeling 
Vern Paxson, Sally Floyd 

June 1995 IEEE/ ACM Transactions on Networking (TON), Volume 3 issue 3 
Publisher: IEEE Press 

Full text available: ^|| pdf(2.18 MB) Additional Information: full citation , references , citings , index terms 



7 Fastpath Optimizations for Cluster Recovery in Shared-Disk Systems 
Randal Burns 

November 2004 Proceedings of the 2004 ACM/IEEE conference on Supercomputing 
Publisher: IEEE Computer Society 

Full text available: ||| pdf( 176.70 KB) Additional Information: full citation , abstract 

We describe the design and implementation of a clustering service for a high-performance, 
shared-disk file system. The service provides failure detection and recovery, reliableend- 
to-end messaging, and a centralized and recoverable management interface. We 
implement novel optimizations in the voting protocol that resolves cluster membership. 
Optimizations allow clusters to form as quickly as possible without introducing livelock or 
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requiring timeout parameters to be tuned carefully. Our treatmen ... 

Simulation potpouri: Simulation model of the cable data network for the analysis and 
evaluation of network performance 
D. Gan, R. Paterson 

December 1982 Proceedings of the 14th conference on Winter Simulation - Volume 2 
Publisher: Winter Simulation Conference 

Full text available: ^pdf(1.28 MB) Additional Information: full citation , abstract 

A Cable Data Network (CDN) simulation model was developed on VAX 11/780 computer 
facility in PASCAL as a part of the MX-C<sup>3</sup> system study. Its primary purpose 
was to supplement theoretical analysis and to evaluate the impact of changing the CDN 
(sub)system requirements on the performance measured primarily in terms of network 
reaction time and queue (buffer) buildup at the CDN nodes. The validated simulation 
model provided a powerful tool in rapidly determining the quantitati ... 

Partition testing, stratified sampling, and cluster analysis 
Andy Podgurski, Charles Yang 

December 1993 ACM SIGSOFT Software Engineering Notes , Proceedings of the 1st 
ACM SIGSOFT symposium on Foundations of software engineering 
SIGSOFT '93, Volume 18 Issue 5 
Publisher: ACM Press 

Full text available: HH pdf(1.35 MB) Additional Information: full citation , abstract , references , citings , index 

terms 

We present a new approach to reducing the manual labor required to estimate software 
reliability. It combines the ideas of partition testing methods with those of stratified 
sampling to reduce the sample size necessary to estimate reliability with a given degree of 
precision. Program executions are stratified by using automatic cluster analysis to group 
those with similar features. We describe the conditions under which stratification is 
effective for estimating softw ... 

10 Method for distributed transaction commit and recovery using Byzantine Agreement 
within clusters of processors 
C. Mohan, R. Strong, S. Finkelstein 

July 1985 ACM SZGOPS Operating Systems Review, volume 19 issue 3 
Publisher: ACM Press 

Full text available: ^pdf(1.11 MB) Additional Information: full citation , abstract , references 

This paper describes an application of Byzantine Agreement [DoSt82a, DoSt82e, LyFF82] 
to distributed transaction commit. We replace the second phase of one of the commit 
algorithms of [MoU83] with Byzantine Agreement, providing certain trade-offs and 
advantages at the time of commit and providing speed advantages at the time of recovery 
from failure. The present work differs from that presented in [DoSt82b] by increasing the 
scope (handling a general tree of processes, and multi-cluster transac ... 



11 Method for distributed transaction commit and recovery using Byzantine Agreement 
within clusters of processors 
C. Mohan, R. Strong, S. Finkelstein 

August 1983 Proceedings of the second annual ACM symposium on Principles of 
distributed computing 

Publisher: ACM Press 

Full text available- 9 pdf(939.80 KB) Additional Information: full citation, abstract, references , citings, index 
uzi^—* terms 

This paper describes an application of Byzantine Agreement [DoSt82a, DoSt82c, LyFF82] 
to distributed transaction commit. We replace the second phase of one of the commit 
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algorithms of [MoLi83] with Byzantine Agreement, providing certain trade-offs and 
advantages at the time of commit and providing speed advantages at the time of recovery 
from failure. The present work differs from that presented in [DoSt82b] by increasing the 
scope (handling a general tree of processes, and multi-cluster tr ... 

12 Technical papers: consistency management and quality assurance: Automated 
support for classifying software failure reports 

Andy Podgurski, David Leon, Patrick Francis, Wes Masri, Melinda Minch, Jiayang Sun, Bin 
Wang 

May 2003 Proceedings of the 25th International Conference on Software 
Engineering 

Publisher: IEEE Computer Society 

Full text available: g^fCLOeMBI^ Additional Information: full citation , abstract , references , citings , index 
Publisher Site 

This paper proposes automated support for classifying reported software failures in order 
to facilitate prioritizing them and diagnosing their causes. A classification strategy is 
presented that involves the use of supervised and unsupervised pattern classification and 
multivariate visualization. These techniques are applied to profiles of failed executions in 
order to group together failures with the same or similar causes. The resulting 
classification is then used to assess the frequency and s ... 



13 Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance j|§ 
Application 

Joseph F. Murray, Gordon F. Hughes, Kenneth Kreutz-Delgado 
September 2005 The Journal of Machine Learning Research, Volume 6 
Publisher: MIT Press 

Full text available: ^ pdf(274.51 KB) Additional Information: full citation , abstract 

We compare machine learning methods applied to a difficult real-world problem: predicting 
computer hard-drive failure using attributes monitored internally by individual drives. The 
problem is one of detecting rare events in a time series of noisy and nonparametrically- 
distributed data. We develop a new algorithm based on the multiple-instance learning 
framework and the naive Bayesian classifier (mi-NB) which is specifically designed for the 
low false-alarm case, and is shown to have promising p ... 



14 Yield modeling and BEOL fundamentals 
Jose Pineda de Gyvez 

March 2001 Proceedings of the 2001 international workshop on System-level 
interconnect prediction 

Publisher: ACM Press 

Full text available' S Ddf(850 38 KB) Addit ' ona ' Information: full citation , abstract , references , citings , index 
' ^ : terms 

The advent of deep submicron technologies with larger die sizes lends itself to an increase 
in fabrication cost. An appropriate yield forecast renders significant benefits in both time- 
to-market and manufacturing cost prediction. Yield forecasting is essential for the 
development of new products as it effectively shows if a design is feasible of meeting its 
cost objectives or not. In mature manufacturing processes, spot defects are the main 
detractors in the successful outcome of an IC. The ... 



15 A failure and overload tolerance mechanism for continuous media servers 
Rajesh Krishnan, Dinesh Venkatesh, Thomas D. C. Little 

November 1997 Proceedings of the fifth ACM international conference on Multimedia 
Publisher: ACM Press 

Full text available: Additional Information: 
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Keywords: caching, clustered video servers, content insertion, fault tolerance, interactive 
video-on-demand, overload tolerance, rate adaptive stream merging, stream clustering 



16 A checkpoint protocol for an entry consistent shared memory system 
Nuno Neves, Miguel Castro, Paulo Guedes 

August 1994 Proceedings of the thirteenth annual ACM symposium on Principles of 
distributed computing 

Publisher: ACM Press 

Full text available: 1p )pdf(1.09 MB) Additional Information: full citation , references , citings , index terms 



17 Development of generic simulation models to evaluate wafer fabrication cluster tools 
Neal G. Pierce, Michael J. Drevna 

December 1992 Proceedings of the 24th conference on Winter simulation 
Publisher: ACM Press 

Full text available: f pl pdf(449.87 KB) Additional Information: full citation , references , citings, index terms 



18 Industry/government track papers: Effective localized regression for damage 
detection in large complex mechanical structures 
Aleksandar Lazarevic, Ramdev Kanapady, Chandrika Kamath 

August 2004 Proceedings of the tenth ACM SIGKDD international conference on 
Knowledge discovery and data mining KDD '04 

Publisher: ACM Press 

Full text available: ^ pdf(597.35 KB) Additional Information: full citation , abstract , references , index terms 

In this paper, we propose a novel data mining technique for the efficient damage detection 
within the large-scale complex mechanical structures. Every mechanical structure is 
defined by the set of finite elements that are called structure elements. Large-scale 
complex structures may have extremely large number of structure elements, and 
predicting the failure in every single element using the original set of natural frequencies 
as features is exceptionally time-consuming task. Traditional data m ... 

Keywords: clustering, damage detection, localized regression, mechanical structures, 
structure elements 



19 Measurement and modeling of computer reliability as affected by system activity 
R. K. Iyer, D. J. Rossetti, M. C. Hsueh 

August 1986 ACM Transactions on Computer Systems (TOCS), Volume 4 issue 3 
Publisher: ACM Press 

Full text available- fiBpdff1.44 MB) Additional Information: full citation, abstract, references, citings, index 

terms , review 

This paper demonstrates a practical approach to the study of the failure behavior of 
computer systems. Particular attention is devoted to the analysis of permanent failures. A 
number of important techniques, which may have general applicability in both failure and 
workload analysis, are brought together in this presentation. These include: smeared 
averaging of the workload data, clustering of like failures, and joint analysis of workload 
and failures. Approximately 17 percent of all failure ... 
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20 Efficient estimation of the mean time between failures in non-reaenerative 

dependability models 
^ Peter W. Glynn, Philip Heidelberger, Victor F. Nicola, Perwez Shahabuddin 

December 1993 Proceedings of the 25th conference on Winter simulation 

Publisher: ACM Press 

Full text available: ^j) pdf(627.98 KB) Additional Information: full citation , references , citings 
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21 A framework for time indexing in sensor networks 
Guanghui He, Rong Zheng, Indranil Gupta, Lui Sha 

August 2005 ACM Transactions on Sensor Networks (TOSN), Volume l issue l 
Publisher: ACM Press 

Full text available: ^ pdf(1.24 MB) Additional Information: full citation , abstract , references , index terms 

In this article, we define the time-indexing problem as the in-network storage and 
querying of sensor network data based solely on the time attribute. We argue qualitatively 
why existing storage schemes may be insufficient as solutions. We then present, analyze, 
and evaluate novel and lightweight solutions to both the storage and the querying 
subproblems for time indexing. First, the time-indexed storage problem is formally defined 
and two formulations are presented seeking to optimize ge ... 



Keywords: Time indexing, information retrieval, rendezvous point 



22 Cluster-based scalable network services j 
Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, Paul Gauthier 
October 1997 ACM SIGOPS Operating Systems Review , Proceedings of the sixteenth 
ACM symposium on Operating systems principles SOSP '97, volume 31 issue 

5 

Publisher: ACM Press 

Full text available: fBI pdf(2.42 MB) Additional Information: full citation , references , citings , index terms 



23 E-textiles: Challenges and opportunities in electronic textiles modeling and 
optimization 

Diana Marculescu, Radu Marculescu, Pradeep K. Khosla 
June 2002 Proceedings of the 39th conference on Design automation 

Publisher: ACM Press 

Full text available: fa odf(769.90 KB) Additional Information: full citation , abstract, references , citings, index 
^ terms 

This paper addresses an emerging new field of research that combines the strengths and 
capabilities of electronics and textiles in one: electronic textiles, or e-textiles. E-textiles, 
also called Smart Fabrics, have not only "wearable" capabilities like any other garment, 
but also local monitoring and computation, as well as wireless communication capabilities. 
Sensors and simple computational elements are embedded in e-textiles, as well as built 
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into yarns, with the goal of gathering sensitive ... 

24 Active learning for automatic classification of software behavior 
James F. Bowring, James M. Rehg, Mary Jean Harrold 

July 2004 ACM SIGSOFT Software Engineering Notes , Proceedings of the 2004 ACM 
SIGSOFT international symposium on Software testing and analysis ISSTA 
'04, Volume 29 Issue 4 
Publisher: ACM Press 

Full text available: ^ pdf(567.57 KB) Additional Information: full citation , abstract , references , index terms 

A program's behavior is ultimately the collection of all its executions. This collection is 
diverse, unpredictable, and generally unbounded. Thus it is especially suited to statistical 
analysis and machine learning techniques. The primary focus of this paper is on the 
automatic classification of program behavior using execution data. Prior work on classifiers 
for software engineering adopts a classical batch-learning approach. In contrast, we 
explore an active-learning paradigm for ... 

Keywords: Markov models, machine learning, software behavior, software testing 



25 Evaluation of cluster tool throughput for thin film head production 
Eric J. Koehler, Timbur M. Wulf, Alvin C. Bruska, Marvin S. Seppanen 

December 1999 Proceedings of the 31st conference on Winter simulation: Simulation- 

— a bridge to the future - Volume 1 
Publisher: ACM Press 

Full text available: ^ pdf(83.62 KB) Additional Information: full citation , references , citings , index terms 



26 High quality behavioral verification using statistical stopping criteria 
A. Hajjar, T. Chen, I. Munn, A. Andrews, M. Bjorkman 

March 2001 Proceedings of the conference on Design, automation and test in Europe 
Publisher: IEEE Press 

Full text available: ^pdfd 43,66 KB) Additional Information: full citation , references , citings , index terms 



Keywords: VHDL, behavioral model verification, statistical stopping rules 



27 Redundancy in model specifications for discrete event simulation 
Richard E. Nance, C. Michael Overstreet, Ernest H. Page 

July 1999 ACM Transactions on Modeling and Computer Simulation (TOM ACS), volume 

9 Issue 3 
Publisher: ACM Press 

Full text available: Wi pdf(295,90 KB) Additional Information: full citation , abstract , references, citings, index 
™ : terms 

Although redundancy in model specification generally has negative connotations, we offer 
arguments for revising those convictions. Defining "representational redundancy" as the 
inclusion of any symbols not required to fulfill the study objectives, we cite several sources 
of redundancy, classified as accidental or intentional, that contribute positively to the 
model development tasks. Comparative benefits and detriments are discussed briefly. 
Focusing on the most interesting sour ... 

Keywords: discrete event simulation, model analysis, model development environment, 
uses of redundancy 
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28 Simulation model decomposition by factor analysis 
Kenneth W. Bauer, Bipin Kochar, Joseph J. Talavage 

December 1985 Proceedings of the 17th conference on Winter simulation 
Publisher: ACM Press 

Full text available: pdf(319.81 KB) Additional Information: full citation , abstract , references , citings 

This paper offers a solution to the simulation model decomposition problem discussed 
briefly in Overstreet and Nance [1] and elaborated in detail in Overstreet [2]. The solution 
scheme involves the use of principal components analysis. We offer an example of the 
technique on a simple directed graph and then demonstrate the method on a small model 
given in Overstreet [2]. 

29 The pebble crurching model for load balancing in concurrent hypercube ensembles 
J. Barhen, S. Gulati, S. S. Iyengar 

January 1988 Proceedings of the third conference on Hypercube concurrent 

computers and applications: Architecture, software, computer systems, 
and general issues - Volume 1 
Publisher: ACM Press 

Full text available- 1S3 Ddfd 36 MB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

The successful development of fifth generation systems require enormous computational 
capability and flexibility necessitating the ability to achieve operational responses in hard 
real-time through optimal resource utilization. This entails dynamically balancing the 
computational load among all the processing nodes in the system. We propose a graph- 
theoretic, receiver-initiated, distributed protocol for dynamic load balancing protocol in 
large-scale hypercube ensembles. Using attributed hyp ... 

30 Research sessions: consistency and availability: Highly available, fault-tolerant, 
parallel dataflows 

Mehul A. Shah, Joseph M. Hellerstein, Eric Brewer 
June 2004 Proceedings of the 2004 ACM SIGMOD international conference on 

Management of data 
Publisher: ACM Press 

Full text available: ^pdf(210.17 KB) Additional Information: full citation , abstract , references , citings 

We present a technique that masks failures in a cluster to provide high availability and 
fault-tolerance for long-running, parallelized dataflows. We can use these dataflows to 
implement a variety of continuous query (CQ) applications that require high-throughput, 
24x7 operation. Examples include network monitoring, phone call processing, click-stream 
processing, and online financial analysis. Our main contribution is a scheme that carefully 
integrates traditional query processing techniques for ... 

31 A taxonomy of wireless micro-sensor network models 
SameerTilak, Nael B. Abu-Ghazaleh, Wendi Heinzelman 

April 2002 ACM SIGMOBILE Mobile Computing and Communications Review, Volume 6 

Issue 2 
Publisher: ACM Press 

Full text available: «pdf(66.31 KB) Addltional Information: full citation , abstract, references , dtjncjs, index 
m terms 

In future smart environments, wireless sensor networks will play a key role in sensing, 
collecting, and disseminating information about environmental phenomena. Sensing 
applications represent a new paradigm for network operation, one that has different goals 
from more traditional wireless networks. This paper examines this emerging field to 
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classify wireless micro-sensor networks according to different communication functions, 
data delivery models, and network dynamics. This taxonomy will aid in ... 

32 Computing the performability of layered distributed systems with a management 
architecture 

Olivia Das, C. Murray Woodside 

January 2004 ACM SIGSOFT Software Engineering Notes , Proceedings of the 4th 

international workshop on Software and performance WOSP '04, Volume 
29 Issue 1 
Publisher: ACM Press 

Full text available: ^ pdf(942.77 KB) Additional Information: full citation , abstract , references 

This paper analyzes the performability of client-server applications that use a separate 
fault management architecture for monitoring and controlling of the status of the 
application software and hardware. The analysis considers the impact of the management 
components and connections, and their reliability, on performability. The approach 
combines minpath algorithms, Layered Queueing analysis and non-coherent fault tree 
analysis techniques for efficient computation of expected reward rate of the ... 

Keywords: distributed systems, layered queueing networks, non-coherent fault trees, 
performability, system fault-tolerance 




33 A Self-Organizing Storage Cluster for Parallel Data-Intensive Applications 
Hong Tang, Aziz Gulbeden, Jingyu Zhou, William Strathearn, Tao Yang, Lingkun Chu 
November 2004 Proceedings of the 2004 ACM/IEEE conference on Supercomputing 
Publisher: IEEE Computer Society 

Full text available: ||| pdf(330.26 KB) Additional Information: full citation , abstract 

Cluster-based storage systems are popular for data-intensive applications and it is 
desirable yet challenging to provide incremental expansion and high availability while 
achieving scalability and strong consistency. This paper presents the design and 
implementation of a self-organizing storage cluster called Sorrento, which targets data- 
intensive workload with highly parallel requests and low write-sharing patterns. Sorrento 
automatically adapts to storage node joins and departures, and the sys ... 

34 Finding Latent Code Errors via Machine Learning over Program Executions 
Yuriy Brun, Michael D. Ernst 

May 2004 Proceedings of the 26th International Conference on Software 
Engineering 

Publisher: IEEE Computer Society 

Full text available: ^pdfd 83.04 KB) Additional Information: full citation , abstract , citings 

This paper proposes a technique for identifying programproperties that indicate errors. The 
technique generates machinelearning models of program properties known to resultfrom 
errors, and applies these models to program propertiesof user-written code to classify and 
rank propertiesthat may lead the user to errors. Given a set of propertiesproduced by the 
program analysis, the technique selectssubset of properties that are most likely to reveal 
an error.An implementation, the Fault Invariant Cla ... 

35 A scalable, robust network for parallel computing | 
Peter Cappello, Dimitros Mourloukos 

June 2001 Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande 

Publisher: ACM Press 

Full text available: ^ pdf(822.74 KB) Additional Information: full citation , abstract , references , index terms 
CX, a network-based computational exchange, is presented. The system's design 
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integrates variations of ideas from other researchers, such as work stealing, non-blocking 
tasks, eager scheduling, and space-based coordination. The object-oriented API is simple, 
compact, and cleanly separates application logic from the logic that supports interprocess 
communication and fault tolerance. Computations, of course, run to completion in the 
presence of computational hosts that join and leave the ongoin ... 

Keywords: Java, network computing, parallel processing, robust, scalable 



36 Industry/government track paper: Dynamic syslog mining for network failure 
monitoring 

Kenji Yamanishi, Yuko Maruyama 

August 2005 Proceeding of the eleventh ACM SIGKDD international conference on 
Knowledge discovery in data mining KDD '05 

Publisher: ACM Press 

Full text available: |p pdf(684.40 KB) Additional Information: full citation , abstract , references , index terms 

Syslog monitoring technologies have recently received vast attentions in the areas of 
network management and network monitoring. They are used to address a wide range of 
important issues including network failure symptom detection and event correlation 
discovery. Syslogs are intrinsically dynamic in the sense that they form a time series and 
that their behavior may change over time. This paper proposes a new methodology of 
dynamic syslog mining in order to detect failure symptoms w ... 

Keywords: correlation analysis, failure detection, model selection, probabilistic modeling, 
syslog mining 




37 A characterization of the simple failure-biasing method for simulations of highly 

reliable Markovian Systems 
Marvin K. Nakayama 

January 1994 ACM Transactions on Modeling and Computer Simulation (TOM ACS), 

Volume 4 Issue 1 
Publisher: ACM Press 

Full text available- 1 sj)pdf(2,25 MB) Additional Information: full citation, abstract, references, citings, index 
" ^ terms 

Simple failure biasing is an importance-sampling technique used to reduce the variance of 
estimates of performance measures and their gradients in simulations of highly reliable 
Markovian systems. Although simple failure biasing yields bounded relative error for the 
performance measure estimate when the system is balanced, it may not provide bounded 
relative error when the system is unbalanced. In this article, we provide a characterization 
of when the simple failure-biasing meth ... 

Keywords: balanced failure biasing, gradient estimation, highly reliable systems, 
importance sampling, likelihood ratios, simple failure biasing 



38 Capturing, indexing, clustering, and retrieving system history 

Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, Armando Fox 
October 2005 ACM SIGOPS Operating Systems Review , Proceedings of the twentieth 
ACM symposium on Operating systems principles SOSP '05, volume 39 issue 

5 

Publisher: ACM Press 

Full text available: ^pdf(516.41 KB) Additional Information: full citation , abstract , references , index terms 

We present a method for automatically extracting from a running system an indexable 
signature that distills the essential characteristic from a system state and that can be 
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subjected to automated clustering and similarity-based retrieval to identify when an 
observed system state is similar to a previously-observed state. This allows operators to 
identify and quantify the frequency of recurrent problems, to leverage previous diagnostic 
efforts, and to establish whether problems seen at dif ... 

Keywords: bayesian networks, clustering, information retrieval, performance objectives, 
signatures 



39 Bug localization: SOBER: statistical model-based bug localization 
Jiawei Han, Samuel P. Midkiff 

September 2005 Proceedings of the 10th European software engineering conference 
held jointly with 13th ACM SIGSOFT international symposium on 
Foundations of software engineering ESEC/FSE-13 
Publisher: ACM Press 

Full text available: ^pdf(214.23 KB) Additional Information: full citation , abstract , references , index terms 

Automated localization of software bugs is one of the essential issues in debugging aids. 
Previous studies indicated that the evaluation history of program predicates may disclose 
important clues about underlying bugs. In this paper, we propose a new statistical model- 
based approach, called SOBER, which localizes software bugs without any prior knowledge 
of program semantics. Unlike existing statistical debugging approaches that select 
predicates correlated with program failures, SOBER mo ... 

Keywords: localization metrics, statistical debugging 
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